inequality follow
Bilevel Optimization over Saddle Points of Zero-Sum Markov Games
Zheng, Zihao, King, Irwin, Lu, Songtao
Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.
Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions
In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity guarantee for finding an $ε$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded
Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
This paper presents a parametric solution to piecewise linear regression through the Adaptive Block Gradient Descent (ABGD) algorithm. The heart of the method is the parametrization of piecewise linear functions as the difference of max-affine (DoMA) functions. A non-asymptotic local convergence analysis for ABGD is provided under sub-Gaussian covariate and noise distributions. To initialize ABGD, we adapt a prior algorithm originally developed for the simpler setting of max-affine functions. When suitably initialized, ABGD converges linearly to an $ε$-accurate estimate given $\tilde{\mathcal{O}}(d\max(σ_z/ε,1)^2)$ observations where $σ_z^2$ denotes the noise variance. This implies exact recovery given $\tilde{\mathcal{O}}(d)$ samples in the noiseless case. Also, such a rate is shown to be minimax optimal up to logarithmic factors. Synthetic numerical results corroborate the theoretical guarantees for ABGD. We also observe competitive performance compared to the state-of-the-art methods on real-world datasets.
An Exploration-by-Optimization Approach to Best of Both Worlds in Linear Bandits
In this paper, we consider how to construct best-of-both-worlds linear bandit algorithms that achieve nearly optimal performance for both stochastic and adversarial environments. For this purpose, we show that a natural approach referred to as exploration by optimization [Lattimore and Szepesvári, 2020b] works well.
AUnified Model and Dimension for Interactive Estimation
We study an abstract framework for interactive learning called interactive estimation in which the goal is to estimate a target from its "similarity" to points queried by the learner. We introduce a combinatorial measure called dissimilarity dimension which is used to derive learnability bounds in our model. We present a simple, general, and broadly-applicable algorithm, for which we obtain both regret and PAC generalization bounds that are polynomial in the new dimension. We show that our framework subsumes and thereby unifies two classic learning models: statistical-query learning and structured bandits. We also delineate how the dissimilarity dimension is related to well-known parameters for both frameworks, in some cases yielding significantly improved analyses.
Noise-Adaptive Thompson Sampling for Linear Contextual Bandits
Linear contextual bandits represent a fundamental class of models with numerous real-world applications, and it is critical to developing algorithms that can effectively manage noise with unknown variance, ensuring provable guarantees for both worst-case constant-variance noise and deterministic reward scenarios.
AThe Algorithm
Construct optimistic MDP fMk and compute optimistic policy πk (Algorithm 5). When the counter is 0 it gets (s,a), i.e., Ωi,e = (s,a,). When the counter is 1, we take (s,a) from ωn and map them to ωn/2 while eliminating half of the factors in consideration with the consistent scope Zi chosen by the policy (stored in factor 2d+ 1 + iof the state). It is handled similarly to the previous item, but considers the reward consistent scope zj chosen by the policy (stored in factor 3d+ 1 + j of the state). For i = 1,...,d, the i-th factor is taken from factor i of the previous state when the counter is not log n + 1, and otherwise performs the optimistic transition of factor i. Denote the value in the last factor of Ωi,e by ve, the policy's chosen scope by Zi (stored in factor 2d+ 1 + iof the state) and the policy's chosen next state direction by s0i (stored in factor d+ 1 + iof the state).